Final Portfolio

Author

Hugh Lyu

Published

July 24, 2025

About Me

Photo

Me with our team at the Notre Dame Stadium to celebrate the win

Biography

My name is Hugh Lyu, a senior at the University of Notre Dame double majoring in Economics and Finance. Originally from China, I came to the U.S. to challenge myself in a new academic and cultural environment. Over the past three years, I’ve built a foundation in data analysis, economic modeling, and financial theory—while actively applying those skills through internships in investment banking, private equity, and asset management across the U.S. and Asia.

I’m also the elder twin and a passionate football player. In 2022–2023, I helped lead my residence hall team to win the Notre Dame Interhall Championship. Football has been more than a sport to me—it’s where I developed discipline, resilience, and teamwork. In my free time, I train for speed and power, aiming to break personal athletic records while maintaining a consistent self-improvement routine. I’m currently preparing for graduate study in quantitative finance and applied economics, with a strong interest in macro-financial modeling and machine learning.

You can reach me at jlyu@nd.edu.


Week 1

In my first assignment, I analyzed the BMI of baseball players to understand how body composition (weight) might relate to performance, particularly in terms of runs scored. This assignment involved computing BMI from height and weight data, classifying players as overweight or not, and examining trends through summary statistics and a scatterplot.

Part B

In this part we want to examine the body mass index of the players.

  1. The Body Mass Index (BMI) can be calculated by multiplying the weight times 703 and dividing by the height squared.   Create a new variable in the data frame that gives the players’ BMI.

    # Load the dataset
    baseball <- read.csv("~/Desktop/Study/2025Summer/R/wk6/baseball_players.csv")
    
    # Calculate BMI using the formula: (weight * 703) / height^2
    baseball$bmi <- baseball$weight * 703 / (baseball$height ^2)
  2. A person with a BMI greater than or equal to 25 is generally considered to be “overweight”.  What percentage of the players would fall into this category?  (hint: use the mean function and a logical variable to find the percentage).

    # Calculate the percentage of players who are overweight
    mean(baseball$bmi >= 25) * 100 
    [1] 70.65481

    70.65% of the players would fall into this category.

  3. A person with a BMI under 18.5 would be considered “underweight”.  What percentage of the players would fall into this category?

    # Calculate the percentage of players who are underweight
    mean(baseball$bmi < 18.5) * 100
    [1] 0

    0% of the players would fall into this category.

  4. Create a new variable in the data frame that is a logical variable indicating if a person is overweight. Convert this variable to a factor with labels for the levels “Normal” and “High_BMI”.  Create a table of this variable that gives the count of the individuals in the two categories. 

    # Convert the logical variable to a factor with labels "Normal" and "High_BMI"
    
    baseball$over_weight <- baseball$bmi >= 25
    baseball$over_weight <- factor(baseball$over_weight, labels = c("Normal", "High_BMI"))
    
    # Display a table of BMI categories
    table(baseball$over_weight)
    
      Normal High_BMI 
         363      874 
  5. Create a scatterplot with BMI on the horizontal axis and runs scored on the vertical axis.

    # Create a scatterplot of BMI vs. runs scored
    plot(baseball$bmi,baseball$runs_scored)

  6. Does the data indicate that there is a relationship between BMI and runs scored. Explain in a few sentences.

From the scatterplot, we cannot find a strong correlation between BMI and runs scored, as players’ runs scored performances varies quite a lot from 0 runs to 100 runs in the BMI range from 20 to 35. However, players with a BMI above 35 tend to score fewer runs overall.

Self Evaluation

This was my first homework assignment using R, and I didn’t yet know how to use a Quarto document properly—so I missed some formatting and reporting requirements like limiting the output to only the last 10 observations. Despite that, I made meaningful progress by learning how to create new variables directly within a dataset using the $ operator, which was a foundational concept for everything that came afterward. While the code was relatively simple, I began to understand how logic, transformation, and visualization all fit together in data analysis under the R environment.


Week 2

In this assignment, I used the NHANES dataset (National Health and Nutrition Examination Survey), a dataset contains health, dietary, and demographic information from representative samples in the U.S., to explore how height and weight are related across different ages. I also investigated whether the observed decline in average height among older individuals could be attributed to gender differences—specifically, the common assumption that women live longer and are shorter on average. By separating the data by gender, I tested whether aging alone explains the trend.

Part A

The NHANES library in R contains results from the National Health and Nutrition Examination Survey (NHANES). This data set can be loaded using:

For this assignment we will make use of the NHANESraw data set included in this package. This data set includes a variety of variables about health-related issues.

Q1

Examine the variables in the NHANESraw data set by performing str(NHANESraw).

# Load NHANES dataset
library(NHANES)

# Examine the variables - using the head function to avoid large output
head(NHANESraw, 3)
     ID SurveyYr Gender Age AgeMonths Race1 Race3   Education MaritalStatus
1 51624  2009_10   male  34       409 White  <NA> High School       Married
2 51625  2009_10   male   4        49 Other  <NA>        <NA>          <NA>
3 51626  2009_10   male  16       202 Black  <NA>        <NA>          <NA>
     HHIncome HHIncomeMid Poverty HomeRooms HomeOwn       Work Weight Length
1 25000-34999       30000    1.36         6     Own NotWorking   87.4     NA
2 20000-24999       22500    1.07         9     Own       <NA>   17.0     NA
3 45000-54999       50000    2.27         5     Own NotWorking   72.3     NA
  HeadCirc Height   BMI BMICatUnder20yrs      BMI_WHO Pulse BPSysAve BPDiaAve
1       NA  164.7 32.22             <NA>    30.0_plus    70      113       85
2       NA  105.4 15.30             <NA>    12.0_18.5    NA       NA       NA
3       NA  181.3 22.00             <NA> 18.5_to_24.9    68      109       59
  BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3 Testosterone DirectChol TotChol
1    114     88    114     88    112     82           NA       1.29    3.49
2     NA     NA     NA     NA     NA     NA           NA         NA      NA
3    112     62    114     60    104     58           NA       1.55    4.97
  UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 Diabetes DiabetesAge HealthGen
1       352         NA        NA         NA       No          NA      Good
2        NA         NA        NA         NA       No          NA      <NA>
3       281      0.415        NA         NA       No          NA     Vgood
  DaysPhysHlthBad DaysMentHlthBad LittleInterest Depressed nPregnancies nBabies
1               0              15           Most   Several           NA      NA
2              NA              NA           <NA>      <NA>           NA      NA
3               2               0           <NA>      <NA>           NA      NA
  Age1stBaby SleepHrsNight SleepTrouble PhysActive PhysActiveDays TVHrsDay
1         NA             4          Yes         No             NA     <NA>
2         NA            NA         <NA>       <NA>             NA     <NA>
3         NA             8           No        Yes              5     <NA>
  CompHrsDay TVHrsDayChild CompHrsDayChild Alcohol12PlusYr AlcoholDay
1       <NA>            NA              NA             Yes         NA
2       <NA>             4               1            <NA>         NA
3       <NA>            NA              NA            <NA>         NA
  AlcoholYear SmokeNow Smoke100 SmokeAge Marijuana AgeFirstMarij RegularMarij
1           0       No      Yes       18       Yes            17           No
2          NA     <NA>     <NA>       NA      <NA>            NA         <NA>
3          NA     <NA>     <NA>       NA      <NA>            NA         <NA>
  AgeRegMarij HardDrugs SexEver SexAge SexNumPartnLife SexNumPartYear SameSex
1          NA       Yes     Yes     16               8              1      No
2          NA      <NA>    <NA>     NA              NA             NA    <NA>
3          NA      <NA>    <NA>     NA              NA             NA    <NA>
  SexOrientation WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA PregnantNow
1   Heterosexual 80100.54 81528.77       1       83        <NA>
2           <NA> 53901.10 56995.04       2       79        <NA>
3           <NA> 13953.08 14509.28       1       84        <NA>

Q2

Create a scatterplot for the “Age” and “Height” variables. Along with the scatterplot, include a few sentences that describe the relationship between age and height observed in the data.

# Create a scatterplot of Height vs. Age
library(ggplot2)
ggplot(data = NHANESraw, aes(x = Age, y = Height))+ geom_point() + labs(x = "Age (years)", y = "Height (cm)")
Warning: Removed 2258 rows containing missing values or values outside the scale range
(`geom_point()`).

Before age of 18, there’s a significant positive correlation between age and height, as young children are growing fastly. After people become adult, the height remains fairly constant, showing a slight decreasing trend towards the end of the spectrum, when people are aging after 70s.

Q3

As you see from the scatterplot, the data includes individuals below 18 years of age. We want to limit ourselves to only individuals 18 years and older (including 18 years). Using the techniques you learned in week 1 create a new data frame that includes all variables but only those individuals who are over 18 years of age. Create a new version of your Age vs. Height scatterplot including only those who are 18 years of age and older. Include a straight line that shows the trend of this data.

# Filtering the data to people that are over the age of 18 

Adult <- NHANESraw[NHANESraw$Age >= 18,]

# Create a scatterplot of Height vs. Age with a linear model straightline
ggplot(data = Adult, aes(x = Age, y = Height))+geom_point() + geom_smooth(method = "lm") + labs(x = "Age (years)", y = "Height (cm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 561 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 561 rows containing missing values or values outside the scale range
(`geom_point()`).

Q4

Examine the scatterplot of Age vs. Height for those over 18. There is a slight downward trend indicating that as individuals get older they tend to be shorter. One expert working with this data stated “Women tend to live longer than men. As a result older groups will include more women. Since women tend to be shorter than men there will be a downward trend in height for older people.” We would like to explore the experts theory with a graphic. Create a graphic that will help explore this theory. Summarize in a few sentences if the graph supports this theory.

# Creating a colored scatterplot, separating genders to examine the correlation of height and age with straightline models
ggplot(data = Adult, aes(x = Age, y = Height, color = Gender)) + geom_point() + geom_smooth(method = "lm", linewidth = 5) + labs(x = "Age (years)", y = "Height (cm)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 561 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 561 rows containing missing values or values outside the scale range
(`geom_point()`).

As shown by the graphic, as age increases, there isn’t a significant trend showing that women are significantly more prevalent in the older groups. The parallel decline across genders suggests that aging itself, rather than gender, plays a more consistent role in impacting height. The expert’s claim cannot fully hold water here.

Self Evaluation

This was the first time I worked with ggplot2, and it opened up a new way for me to visualize relationships in data. I learned how to add smoothing lines using geom_smooth(method = "lm"), customize axes and titles, and use color to distinguish subgroups. These techniques made the plots more readable and meaningful. The findings confirmed that aging itself—rather than gender—is the main driver of height decline in older groups. This exercise gave me confidence in using visual tools to support statistical analysis.


Week 3

This assignment focused on using the tidyverse toolkit, especially the pipe “%>%” operator, to streamline data manipulation and filtering. In Part C, I analyzed the Berkeley arrest dataset to examine two things: which weekdays had the highest frequency of arrests, and whether there were discrepancies between the reported age and the actual birth year of those arrested. This required a combining application of the dplyr and the lubridate package.

Part C (5 points)

The police department in the city of Berkley, California releases a list of all individuals arrested. The file Berkley ArrestDownload Berkley Arrestcontains the arrest list for November through early December 2017. The data includes 16 variables about each individual who was arrested and the offense for which they were arrested.

Two of the variables include the subject’s date of birth and the date of the arrest.  The variables are recorded as character variables.

  1. Read the data into R. Provide a glimpse of the data frame.

    # Load the tidyverse package (crucial) and the berkley dataset
    
    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ✔ lubridate 1.9.4     ✔ tibble    3.3.0
    ✔ purrr     1.0.4     ✔ tidyr     1.3.1
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    berkley <- read_csv("~/Desktop/Study/2025Summer/R/wk6/Berkeley_Arrest.csv")
    New names:
    Rows: 185 Columns: 17
    ── Column specification
    ──────────────────────────────────────────────────────── Delimiter: "," chr
    (11): arrest_type, subject, race, sex, height, hair, eyes, statute, sta... dbl
    (4): ...1, arrest_number, age, weight date (2): date_of_birth, arrest_date
    ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
    Specify the column types or set `show_col_types = FALSE` to quiet this message.
    • `` -> `...1`
    # Take a look at the dataset
    glimpse(berkley)
    Rows: 185
    Columns: 17
    $ ...1                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
    $ arrest_number       <dbl> 21105, 21120, 21112, 21110, 21115, 21114, 21111, 2…
    $ arrest_type         <chr> "SUSP. OF FELONY", "COURT FILED (WARRANT)", "ON-VI…
    $ subject             <chr> "Teddy Oliver Webster", "Herbert Stephen Blue", "M…
    $ race                <chr> "Black", "White", "White", "Black", "Other", "Whit…
    $ sex                 <chr> "Male", "Male", "Male", "Male", "Male", "Female", …
    $ date_of_birth       <date> 1968-12-18, 1957-08-30, 1993-02-16, 1958-04-28, 1…
    $ age                 <dbl> NA, 57, 21, NA, 56, 23, 60, 27, 55, 29, 20, 76, 41…
    $ height              <chr> "5 Ft. 9 In.", "5 Ft. 8 In.", NA, "5 Ft. 10 In.", …
    $ weight              <dbl> 135, 140, NA, 170, 152, NA, 200, 185, 185, 140, 13…
    $ hair                <chr> "BRO", "BRO", NA, "BLK", "BRO", "BRO", "BLK", "RED…
    $ eyes                <chr> "BRO", "HAZ", NA, "BRO", "BRO", NA, "BRO", "HAZ", …
    $ statute             <chr> "1203.2 - F; 422; 10852;", "Warr - Out (F);", "484…
    $ statute_type        <chr> "PC; VC;", "PC;", "PC;", "PC;", "PC;", "PC;", "PC;…
    $ statute_description <chr> "Probation Violation : Felony; THREATEN CRIME W/IN…
    $ case_number         <chr> "2017-00067675", NA, "2017-00067861", NA, "2017-00…
    $ arrest_date         <date> 2017-11-06, 2017-11-08, 2017-11-07, 2017-11-07, 2…
  2. Identify which day of the week most arrests occurred and provide a bar chart to support your answer.

    # Convert the arrest date to Date format and extract the day of the week
    berkley <- berkley %>% mutate(arrest_date = ymd(arrest_date), day_of_week = wday(arrest_date, label = TRUE))
    
    # Count number of arrests in each weekday and create a bar chart
    berkley %>% count(day_of_week) %>% ggplot(aes(x = day_of_week, y = n)) + geom_col()

  3. Use the date of birth and the date of arrest to determine the age of the individuals at the time of their arrest.  Add this variable to your data frame and call it “real_age”.   Be sure to create this variable as an age as it would be given.  For example, if I am over 25 but under 26 I will tell people I am 25. You can make use of the floor() function to round down to the nearest integer. Provide a summary() command of the variable your create.

    # Calculate the real age of people at time of arrest by subtracting birth year from arrest year
    berkley <- berkley %>% mutate(real_age = floor(year(arrest_date)-year(date_of_birth)))
    
    # Summarize the variable
    summary(berkley$real_age)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      18.00   27.00   37.00   39.66   50.00   84.00 
  4. The data set also includes a variable that is the age of the subjects. Some of those ages do not match the age you determined in the previous step. Provide a list of subjects in which the actual age you calculated differs from the age recorded by the officer by more than one year in either direction. Provide output that gives the names of the subject, their real_age and their age recorded by the officer. 

    # Identify subjects whose reported age differs from calculated age by more than 1 year
    berkley %>% filter(abs(real_age - age) > 1) %>% select(subject, real_age, age)
    # A tibble: 19 × 3
       subject                       real_age   age
       <chr>                            <dbl> <dbl>
     1 Herbert Stephen Blue                60    57
     2 Michael Joseph May                  24    21
     3 Anthony Wilfred Kerman              29    27
     4 Damon Lamont Jones                  47    43
     5 Gerald Arcos                        29    23
     6 Sara Sofija Antunovich              37    35
     7 Andrew Francis Supple               59    57
     8 Scotty Emmanuel Guess               62    60
     9 CHRISTOPHER RANDOLPH TORRENCE       44    41
    10 Edward Rae Mitchell                 53    46
    11 Farad Ami Green                     43    45
    12 Adan Mora Morfin                    64    61
    13 Christopher Cole Tabor              34    32
    14 Daniel James Blackbear              33    31
    15 Fredrick Arzell Chisom              56    51
    16 Adam Kenneth Jones                  45    40
    17 PRICE WHEELER                       46    42
    18 Louis Joseph Lawyer                 34    31
    19 Nicholas M Shelby                   26    20

Self Evaluation

This was the most time-consuming assignment for me, mainly because I was still adjusting to getting together multiple tidyverse functions. I struggled with the wday() function at first—I wasn’t sure which variable to feed it, and had to revisit the asynchronous lectures to understand its correct usage. But once I got the it right, the code made a lot more sense. I successfully created a bar chart that showed crime incidents peaked slightly on Fridays and weekends. This assignment pushed me to be more patient in debugging and built a much stronger foundation for using the pipelines.


Week 4

This assignment let us practiced spatial data visualization by integrating mapping tools with the ggplot2 and tidyverse packages. In Part B, I used geocoded data to plot a map of Wake County, North Carolina using the stadiamaps API. I analyzed the distribution of homes based on square footage (SQFT) and overlaid the locations of local Costco stores to visually explore potential patterns between home size and store proximity.

Part B (8 points)

Consider again the   Houses Download   Houses data discussed previously. The geocodes for these houses are available in the  Houses GeocodeDownload  Houses Geocode data set.

  1. Open both the Houses and the Houses Geocode data in R. Perform a join to put these two datasets together. Provide a glimpse of the resulting data set.

    # Load housing and geocode datasets
    house1 <- read.csv("~/Desktop/Study/2025Summer/R/wk6/Houses.csv")
    house_geo <- read.csv("~/Desktop/Study/2025Summer/R/wk6/houses_geocode.csv")
    
    # Merge housing data with geocode data using an inner join
    house <- inner_join(house1,house_geo)
    Joining with `by = join_by(Zip, Address)`
    # Take a look at the new dataset
    glimpse(house)
    Rows: 100
    Columns: 14
    $ ID.        <int> 153159, 155943, 44627, 119645, 78570, 53841, 44951, 109325,…
    $ Year       <int> 1990, 1986, 1955, 1983, 2000, 1967, 1924, 1980, 1989, 1994,…
    $ SQFT       <dbl> 2102.1, 1740.0, 795.0, 1152.0, 1404.0, 1752.0, 1829.0, 1246…
    $ Story      <dbl> 1.00, 1.50, 1.00, 1.00, 1.00, 1.00, 1.50, 1.00, 2.00, 2.50,…
    $ Acres      <dbl> 0.77, 0.06, 0.30, 0.68, 39.38, 0.29, 0.25, 0.29, 0.25, 0.36…
    $ Baths      <dbl> NA, 3.0, 1.0, 2.0, 2.0, 1.5, 3.0, 2.0, 2.5, 3.0, 2.0, 1.0, …
    $ Fireplaces <int> 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
    $ Total      <dbl> 203200.0, 119096.0, 71666.3, 131103.0, 4904102.0, 144452.0,…
    $ land       <dbl> 44000.0, 22000.0, 28000.2, 50000.0, 4797750.0, 52000.0, 292…
    $ building   <dbl> 159200.0, 97096.0, 43666.1, 81103.0, 106352.0, 92452.0, 203…
    $ Zip        <int> 27603, 27604, 27610, 27616, 27519, 27604, 27607, 27511, 276…
    $ Address    <chr> "924 Shadywood Ln", "3655 Top of The Pines Ct", "428 Lansin…
    $ lon        <dbl> -78.62990, -78.56623, -78.60988, -78.55883, -78.89214, -78.…
    $ lat        <dbl> 35.64545, 35.83101, 35.77383, 35.87781, 35.84087, 35.82526,…
  2. To create a map of these houses we need a bounding box. Create an appropriate bounding box that will allow all of the data to be mapped.  Using your bounding box, create a map of the Wake County North Carolina area that shows the locations of the houses in the data set.  Use a “toner” map as your map layer.

    # Check the range of longitude and latitude
    range(house$lon)
    [1] -78.93433 -78.46608
    range(house$lat)
    [1] 35.61702 35.95495
    # Load the required library for spatial mapping - ggmap
    library(ggmap)
    ℹ Google's Terms of Service: <https://mapsplatform.google.com>
      Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service>
      OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles>
    ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
    # Register Stadiamaps API key (crucial, the key must be valid to fetch the map)
    register_stadiamaps(key = "72e9489c-8557-48d7-81d6-c859aa66bc3d")
    
    # Create bounding box for Wake County area
    my_box <- c(left=-79, right=-78.4,top=36, bottom=35.6)
    
    # Download Stamen toner map using the bounding box
    my_map <- get_stadiamap(my_box, zoom = 10, maptype = "stamen_toner")
    ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
    # Plot all homes on the base map
    ggmap(my_map)+geom_point(data = house, aes(x = lon, y = lat))

  3. Refine your map to include color coding for the variable “SQFT.” 

    # Add color mapping for house size (SQFT)
    ggmap(my_map)+geom_point(data = house, aes(x = lon, y = lat,color = SQFT))

  4. Refine your map of the SQFT variable by creating a facet grid of the map based on the Story variable.  Consider this map. Do you think visualization provides insights to the story behind the data?  Explain in a few sentences

    # Use facet_wrap to create separate maps by house story level
    ggmap(my_map)+geom_point(data = house, aes(x = lon, y = lat,color = SQFT))+facet_wrap(~Story)

    The visualization shows that 1-story houses are the most common across the county and tend to have smaller square footage on average. As the story level goes higher, the dots on the map becomes fewer - indicating fewer houses, and lighter - indicating larger housing areas. Especially, there’s only 1 dot on the map for the 2.5 story level house with around 4000 square feet. In general, this visualization shows that taller houses are less common and usually larger.

  5. Costco warehouses are a popular shopping location. The data set “  CostcoDownload   Costco” contains the street addresses for the three Costco warehouses in Wake County. Load this data into R and geocode their locations using the tidygeocoder. Provide a glimpse of the resulting data set.

    # Load necessary library
    library(tidygeocoder)
    
    Attaching package: 'tidygeocoder'
    The following object is masked from 'package:ggmap':
    
        geocode
    # Load Costco location addresses and geocode them
    costco <- read.csv("~/Desktop/Study/2025Summer/R/wk6/Costco.csv")
    
    # Obtain latitude and longitude of costco stores
    costco <- costco %>% geocode(address = Address, lat = lat, long = lon, method = "osm")
    Passing 3 addresses to the Nominatim single address geocoder
    Query completed in: 3 seconds
    # Preview the dataset
    glimpse(costco)
    Rows: 3
    Columns: 6
    $ Address <chr> "7800 Fayetteville Rd", "1021 Pine Plaza Dr", "2838 Wake Fores…
    $ City    <chr> " Raleigh", " Apex", " Raleigh"
    $ State   <chr> "NC", "NC", "NC"
    $ Zip     <int> 27603, 27523, 27609
    $ lat     <dbl> 35.66883, 35.74665, 35.81807
    $ lon     <dbl> -78.69932, -78.82686, -78.62147
  6. Recreate your map of the SQFT variable (without facet wrapping) and include the locations of the Costco warehouses. Use a distinctive marker to indicate the locations. Consider using the shape argument in ggplot.

    # Plot housing data with SQFT coloring, Costco locations in red
    ggmap(my_map)+geom_point(data = house, aes(x = lon, y = lat,color = SQFT))+geom_point(data = costco, aes(x = lon, y = lat), color = "red")

  7. Recreate your map from the previous question, but make use of the terrain version of the Stamen map.

    # Recreate the combined map with terrain view and Costco markers
    my_box <- c(left=-79, right=-78.4,top=36, bottom=35.6)
    my_map <- get_stadiamap(my_box, zoom = 10, maptype = "stamen_terrain")
    ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
    ggmap(my_map)+geom_point(data = house, aes(x = lon, y = lat,color = SQFT))+geom_point(data = costco, aes(x = lon, y = lat), color = "red")

Self Evaluation

By the end of Week 4, I felt more confident using the pipe operator and building layered plots with ggplot2. This assignment expanded my skill set with several new tools—most notably, learning to set bounding boxes to constrain map area, registering and using an API key with stadiamaps, and leveraging facet_wrap() to display multiple visual comparisons. This project made it approachable for me to work on geospatial data, and it also made me appreciate how versatile R can be for real-world analysis.


Week 5

In the (final) week, we focused on applying statistical modeling techniques such as linear regression to real-world data. Using the Wake County housing dataset from earlier in the course, I built multiple linear models to evaluate how square footage (SQFT), lot size (acres), and their combination influence housing prices.

Part A

  1. The variable SQFT gives the size of the home in square feet.  We would like to predict the Total price of homes based on the size of the home (SQFT).   Create a linear model that makes this prediction. Give the name “model_1” to this model.   Give a summary() command of this model. 

    # Load Wake County housing dataset
    house <- read.csv("~/Desktop/Study/2025Summer/R/wk6/Houses.csv")
    
    # Create a linear regression model to predict Total price using square footage (SQFT) and view it
    model_1 <- lm(Total ~ SQFT, data = house)
    summary(model_1)
    
    Call:
    lm(formula = Total ~ SQFT, data = house)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -190007  -91711  -64236  -26105 4689712 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  23807.07  126487.61   0.188   0.8511  
    SQFT           135.74      67.58   2.009   0.0473 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 483700 on 98 degrees of freedom
    Multiple R-squared:  0.03955,   Adjusted R-squared:  0.02974 
    F-statistic: 4.035 on 1 and 98 DF,  p-value: 0.04732
  2. From model_1 report the slope of the least squares line.  What does this value tell us?

    The slope equals to 135.74, meaning that for each additional increase of the square foot of house, the total price is projected to increase by $135.74.

  3. From model_1 report the value of the R-squared value.  What does this value tell us?

    The R-squared value is 0.03955, meaning that only 3.955% of the variation in the total price of the houses is explained by the change in the square foot of the houses. This R-squared value is low, which means the correlation between the size of the house and the total price of the house is not strong.

  4. This data contains an outlier that has a Total value of approximately 4.9 million dollars.  Filter your data to create a new data frame that removes this observation.  Use the filtered data to recreate your linear model.  Give the name “model_2” to this model.  Provide the summary() command of this new model. 

    library(tidyverse)
    
    # Filter out an outlier with an unusually high price (~$4.9M)
    house_filter <- house %>% filter(Total < 4900000)
    
    # Create a new linear model without the outlier, view the summary of the new model
    model_2 <- lm(Total ~ SQFT, data = house_filter)
    summary(model_2)
    
    Call:
    lm(formula = Total ~ SQFT, data = house_filter)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -178469  -50742   -6400   25185  418136 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -75819.6    21764.1  -3.484 0.000744 ***
    SQFT           165.9       11.6  14.299  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 82960 on 97 degrees of freedom
    Multiple R-squared:  0.6782,    Adjusted R-squared:  0.6749 
    F-statistic: 204.5 on 1 and 97 DF,  p-value: < 2.2e-16
  5. Examine the R-squared value.  How did this value change between the two models?

    The R-squared value is now 0.6782, meaning that around 67.82% of the variability in the total price of the house is explained by the change in the size of the house (SQFT). This means that by getting rid of the outlier, the correlation between the size of the house and its price becomes strong - the outlier indeed is impacting the accurarcy of the analysis.

  6. The Acres variable gives the amount of land on which the house is located. The outlier that is approximately 4.9 million dollars is due to the large number of acres of the property.  We would like to incorporate the Acres variable in our model.   Create a new model called “model_3”  that uses all of the data.  Predict the Total value of the home with both the size of the home (SQFT) and the Acres variable.  Produce a summary() command of this new model.

    # Create a new model using both Acres and SQFT as predictors and view the summary
    model3 <- lm(Total ~ Acres + SQFT, data = house)
    
    summary(model3)
    
    Call:
    lm(formula = Total ~ Acres + SQFT, data = house)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -189677  -48727   -5686   30044  448248 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -99554.31   24126.44  -4.126 7.79e-05 ***
    Acres       120884.58    2360.00  51.222  < 2e-16 ***
    SQFT           151.79      12.83  11.832  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 91800 on 97 degrees of freedom
    Multiple R-squared:  0.9658,    Adjusted R-squared:  0.9651 
    F-statistic:  1368 on 2 and 97 DF,  p-value: < 2.2e-16
  7. Examine the R-squared value of this model.  How did this value change from the model that removed the observation?

    The R-squared value now is 0.9658, meaning that 96.58% of the variability in total house price can be explained by the size of the house (SQFT) and how many acres does the house is built on. This R-squared value is a sharp increase from 0.6782 for the previous model, meaning that for the “acres” factor makes the model more explanatory, along with its high significance based on the p value, comparing to only counting the “SQFT” factor as the reason of the variability of the house price.

  8. We have a home that has a size of  2750 square feet and sits on 0.5 acres of land.  Use your model to predict the value of this home.

    # Predict the price of a new home with 2750 SQFT and 0.5 acres
    new_df <- data.frame(SQFT = 2750, Acres = 0.5)
    predict(model3, newdata = new_df)
           1 
    378310.1 

Self Evaluation

By this point in the course, I was no longer struggling with formatting or syntax. I could focus on analysis—constructing models, interpreting coefficients, and validating assumptions. It was rewarding to realize that I could now perform the same kind of analysis I once did with pen-and-paper formulas, but much faster and at scale. This assignment felt like a fitting conclusion to the course and an exciting starting point for using R in more advanced, real-world projects.


Summary of Growth

Before starting this course, I hadn’t done any coding for nearly two years. Honestly, I was nervous—worried that I might not be able to keep up. However, this class provided me with a fresh perspective on programming. I learned to treat code like a language and the syntax as grammar—something that can be read, written, revised, and understood logically, not just mechanically. That shift in mindset helped me ease back into programming without feeling overwhelmed. One of the things I appreciated most was the structure of the course.

The asynchronous materials were extremely helpful—I often went back to revisit specific demos or instructions, especially when tackling problems like how to use the partykit package in Week 5. Having multiple assignments each week—the practice problems, the live session discussions, and the homework—really reinforced my learning. It kept concepts fresh. For example, after learning how to create maps using ggmap, I didn’t forget the details because we were actively applying it right away, along with using the ggplot function, not just watching it once and moving on.

Another layer of my growth came from learning SQL inside the RStudio environment. At first, it felt a bit rusty switching between R and SQL syntax, and I definitely caught myself typing == in SQL queries more than once. But this mistake taught me to pay attention to context and precision in coding. Over time, I’ve become much more confident in debugging and understanding documentation using the hints from the console. I’m leaving this course not just knowing more rules, but with a stronger foundation in how to think like a programmer—creative, analytical, and always refining (debugging).